Introduction

Overview and Motivation

Our inspiration was from Kaggle competition - Instacart Market Basket Analysis which is also the data sets’ resource. Instacart is a grocery ordering and delivery application. They provide an anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart’s users, and for each user, they provide between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders (details of each data set will be introduced below).

Instacart hopes campaign participants test models for predicting products that a user will buy again, try for the first time or add to cart next during a session, which may need to use the the models such as XGBoost, word2vec and Annoy.

Repurchase predicting and order placement day predicting are the popular and helpful predictions among e-commerce companies. For example, Amazon has already developed a patent called “anticipatory shipping” that can predict what and when people want to buy and ship packages even before customers have placed an order. In this case, they can largely optimizing logistics management, human and equipment resources and inventory arrangement, so that it would help them to decrease cost and increase profit. Meantime, this type of prediction also requires much more information of customers’ behavior, such as items customers have searched for, the amount of time a user’s cursor hovers over a product, times of clicks by users, purchase conversion rate of users’ click, add to cart, collection and so on.

In this case, since there are limitation of information and we would like to apply what models we have learnt in the course, we prefer to predict the day of the week that the order will be placed. Then, this would be an additional predictor to support the demand forecasting which is useful to make a right direction in the decision-making process, like inventory arrangement, for the e-commerce platform.

Research questions

Overall, we produce a new dataset based on what we have downloaded from the competition website, and assume that:

  1. one order = one user (as the data limitation in the data set mentioned above);
  2. we have already known what customers will buy in the next time, which means we have already known the demand.

Thus, our research questions will be:

  • What day of the week that a given order will be placed?
    For this question, we will use supervised methods - Classification Tree and Multiple Logistic Regression.

  • Are there any common components between departments or aisles?
    For this question, we will use unsupervised methods - PCA and Clustering.

Exploratory Data Analysis

Data Description

orders (3.4m rows, 206k users):
* order_id: order identifier
* user_id: customer identifier
* eval_set: which evaluation set this order belongs in (see SET described below)
* order_number: the order sequence number for this user (1 = first, n = nth)
* order_dow: the day of the week the order was placed on
* order_hour_of_day: the hour of the day the order was placed on
* days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products (50k rows):
* product_id: product identifier
* product_name: name of the product
* aisle_id: foreign key
* department_id: foreign key

aisles (134 rows):
* aisle_id: aisle identifier
* aisle: the name of the aisle

departments (21 rows):
* department_id: department identifier
* department: the name of the department

order_products__SET (30m+ rows):
* order_id: foreign key
* product_id: foreign key
* add_to_cart_order: order in which each product was added to cart
* reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):
* "prior": orders prior to that users most recent order (~3.2m orders)
* "train": training data supplied to participants (~131k orders)
* "test": test data reserved for machine learning competitions (~75k orders)

Table 1 - aisles
#> [1] 0
The aisles table
aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
11 cold flu allergy
12 fresh pasta
13 prepared meals
14 tofu meat alternatives
15 packaged seafood
16 fresh herbs
17 baking ingredients
18 bulk dried fruits vegetables
19 oils vinegars
20 oral hygiene
21 packaged cheese
22 hair care
23 popcorn jerky
24 fresh fruits
25 soap
26 coffee
27 beers coolers
28 red wines
29 honeys syrups nectars
30 latino foods
31 refrigerated
32 packaged produce
33 kosher foods
34 frozen meat seafood
35 poultry counter
36 butter
37 ice cream ice
38 frozen meals
39 seafood counter
40 dog food care
41 cat food care
42 frozen vegan vegetarian
43 buns rolls
44 eye ear care
45 candy chocolate
46 mint gum
47 vitamins supplements
48 breakfast bars pastries
49 packaged poultry
50 fruit vegetable snacks
51 preserved dips spreads
52 frozen breakfast
53 cream
54 paper goods
55 shave needs
56 diapers wipes
57 granola
58 frozen breads doughs
59 canned meals beans
60 trash bags liners
61 cookies cakes
62 white wines
63 grains rice dried goods
64 energy sports drinks
65 protein meal replacements
66 asian foods
67 fresh dips tapenades
68 bulk grains rice dried goods
69 soup broth bouillon
70 digestion
71 refrigerated pudding desserts
72 condiments
73 facial care
74 dish detergents
75 laundry
76 indian foods
77 soft drinks
78 crackers
79 frozen pizza
80 deodorants
81 canned jarred vegetables
82 baby accessories
83 fresh vegetables
84 milk
85 food storage
86 eggs
87 more household
88 spreads
89 salad dressing toppings
90 cocoa drink mixes
91 soy lactosefree
92 baby food formula
93 breakfast bakery
94 tea
95 canned meat seafood
96 lunch meat
97 baking supplies decor
98 juice nectars
99 canned fruit applesauce
100 missing
101 air fresheners candles
102 baby bath body care
103 ice cream toppings
104 spices seasonings
105 doughs gelatins bake mixes
106 hot dogs bacon sausage
107 chips pretzels
108 other creams cheeses
109 skin care
110 pickled goods olives
111 plates bowls cups flatware
112 bread
113 frozen juice
114 cleaning products
115 water seltzer sparkling water
116 frozen produce
117 nuts seeds dried fruit
118 first aid
119 frozen dessert
120 yogurt
121 cereal
122 meat counter
123 packaged vegetables fruits
124 spirits
125 trail mix snack mix
126 feminine care
127 body lotions soap
128 tortillas flat bread
129 frozen appetizers sides
130 hot cereal pancake mixes
131 dry pasta
132 beauty
133 muscles joints pain relief
134 specialty wines champagnes
Table 2 - departments
#> [1] 0
The departments table
department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
6 international
7 beverages
8 pets
9 dry goods pasta
10 bulk
11 personal care
12 meat seafood
13 pantry
14 breakfast
15 canned goods
16 dairy eggs
17 household
18 babies
19 snacks
20 deli
21 missing
Table 3 - products
#> [1] 0
The products table
product_id product_name aisle_id department_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13
6 Dry Nose Oil 11 11
7 Pure Coconut Water With Orange 98 7
8 Cut Russet Potatoes Steam N’ Mash 116 1
9 Light Strawberry Blueberry Yogurt 120 16
10 Sparkling Orange Juice & Prickly Pear Beverage 115 7
11 Peach Mango Juice 31 7
12 Chocolate Fudge Layer Cake 119 1
13 Saline Nasal Mist 11 11
14 Fresh Scent Dishwasher Cleaner 74 17
15 Overnight Diapers Size 6 56 18
16 Mint Chocolate Flavored Syrup 103 19
17 Rendered Duck Fat 35 12
18 Pizza for One Suprema Frozen Pizza 79 1
19 Gluten Free Quinoa Three Cheese & Mushroom Blend 63 9
20 Pomegranate Cranberry & Aloe Vera Enrich Drink 98 7
21 Small & Medium Dental Dog Treats 40 8
22 Fresh Breath Oral Rinse Mild Mint 20 11
23 Organic Turkey Burgers 49 12
24 Tri-Vi-Sol® Vitamins A-C-and D Supplement Drops for Infants 47 11
25 Salted Caramel Lean Protein & Fiber Bar 3 19
26 Fancy Feast Trout Feast Flaked Wet Cat Food 41 8
27 Complete Spring Water Foaming Antibacterial Hand Wash 127 11
28 Wheat Chex Cereal 121 14
29 Fresh Cut Golden Sweet No Salt Added Whole Kernel Corn 81 15
30 Three Cheese Ziti, Marinara with Meatballs 38 1
31 White Pearl Onions 123 4
32 Nacho Cheese White Bean Chips 107 19
33 Organic Spaghetti Style Pasta 131 9
34 Peanut Butter Cereal 121 14
35 Italian Herb Porcini Mushrooms Chicken Sausage 106 12
36 Traditional Lasagna with Meat Sauce Savory Italian Recipes 38 1
37 Noodle Soup Mix With Chicken Broth 69 15
38 Ultra Antibacterial Dish Liquid 100 21
39 Daily Tangerine Citrus Flavored Beverage 64 7
40 Beef Hot Links Beef Smoked Sausage With Chile Peppers 106 12
41 Organic Sourdough Einkorn Crackers Rosemary 78 19
42 Biotin 1000 mcg 47 11
43 Organic Clementines 123 4
44 Sparkling Raspberry Seltzer 115 7
45 European Cucumber 83 4
46 Raisin Cinnamon Bagels 5 count 58 1
47 Onion Flavor Organic Roasted Seaweed Snack 66 6
48 School Glue, Washable, No Run 87 17
49 Vegetarian Grain Meat Sausages Italian - 4 CT 14 20
50 Pumpkin Muffin Mix 105 13
Table 4 - order_products_train
#> [1] 0
The order_products_train table
order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
36 19660 2 1
36 49235 3 0
36 43086 4 1
36 46620 5 1
36 34497 6 1
36 48679 7 1
36 46979 8 1
38 11913 1 0
38 18159 2 0
38 4461 3 0
38 21616 4 1
38 23622 5 0
38 32433 6 0
38 28842 7 0
38 42625 8 0
38 39693 9 0
96 20574 1 1
96 30391 2 0
96 40706 3 1
96 25610 4 0
96 27966 5 1
96 24489 6 1
96 39275 7 1
98 8859 1 1
98 19731 2 1
98 43654 3 1
98 13176 4 1
98 4357 5 1
98 37664 6 1
98 34065 7 1
98 35951 8 1
98 43560 9 1
98 9896 10 1
98 27509 11 1
98 15455 12 1
98 27966 13 1
98 47601 14 1
98 40396 15 1
98 35042 16 1
98 40986 17 1
98 1939 18 1
Table 5 - purchase time per order table
#> [1] 206209
#> [1] 206209
The purchase time per order table
order_id order_dow order_hour_of_day
1187899 4 8
1492625 1 11
2196797 0 11
525192 2 11
880375 1 14
1094988 6 10
1822501 0 19
1827621 0 21
2316178 2 19
2180313 3 10
2461523 6 9
1854765 1 12
3402036 1 12
965160 0 16
2614670 5 14
3110252 4 11
62370 2 13
698604 4 13
1524161 0 13
3173750 0 9
2032076 0 20
2803975 0 11
1864787 5 11
2436259 0 12
1947848 4 20
2906490 4 22
2924697 5 18
519514 4 12
1750084 3 9
1647290 4 16
3088145 2 10
39325 2 18
13318 1 9
1651215 0 12
1019719 2 12
2989905 6 8
2639013 0 13
1072954 6 17
34647 3 19
2757217 0 11
669729 5 12
3038639 5 13
2608424 2 14
482516 4 7
3294399 4 8
1700658 6 11
21708 0 6
2178718 2 8
1734166 5 18
859654 1 10

We can observe on the left chart oder_dow that the most frequent days of ordering are Sunday’s and Monday’s comparing to the rest of the week, and on the right chart order_hour_of_day,we note a high demand of orders between 9am to 6pm.

Table 6 - user_purchases
The user purchases table
order_id order_dow order_hour_of_day aisle_id aisle department_id department
1187899 4 8 77 soft drinks 7 beverages
1187899 4 8 21 packaged cheese 16 dairy eggs
1187899 4 8 120 yogurt 16 dairy eggs
1187899 4 8 54 paper goods 17 household
1187899 4 8 45 candy chocolate 19 snacks
1187899 4 8 117 nuts seeds dried fruit 19 snacks
1187899 4 8 121 cereal 14 breakfast
1187899 4 8 23 popcorn jerky 19 snacks
1187899 4 8 84 milk 16 dairy eggs
1187899 4 8 53 cream 16 dairy eggs
1187899 4 8 77 soft drinks 7 beverages
1492625 1 11 96 lunch meat 20 deli
1492625 1 11 58 frozen breads doughs 1 frozen
1492625 1 11 107 chips pretzels 19 snacks
1492625 1 11 23 popcorn jerky 19 snacks
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 91 soy lactosefree 16 dairy eggs
1492625 1 11 46 mint gum 19 snacks
1492625 1 11 96 lunch meat 20 deli
1492625 1 11 80 deodorants 11 personal care
1492625 1 11 1 prepared soups salads 20 deli
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 69 soup broth bouillon 15 canned goods
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 117 nuts seeds dried fruit 19 snacks
1492625 1 11 3 energy granola bars 19 snacks
1492625 1 11 69 soup broth bouillon 15 canned goods
1492625 1 11 69 soup broth bouillon 15 canned goods
2196797 0 11 29 honeys syrups nectars 13 pantry
2196797 0 11 24 fresh fruits 4 produce
2196797 0 11 21 packaged cheese 16 dairy eggs
2196797 0 11 66 asian foods 6 international
2196797 0 11 101 air fresheners candles 17 household
2196797 0 11 83 fresh vegetables 4 produce
2196797 0 11 66 asian foods 6 international
2196797 0 11 123 packaged vegetables fruits 4 produce

Visualization

Top 10 number of purchase by aisle

The top 10 number of purchase by aisle
aisle department total_order
fresh vegetables produce 150609
fresh fruits produce 150473
packaged vegetables fruits produce 78493
yogurt dairy eggs 55240
packaged cheese dairy eggs 41699
water seltzer sparkling water beverages 36617
milk dairy eggs 32644
chips pretzels snacks 31269
soy lactosefree dairy eggs 26240
bread bakery 23635

The number of purchase by department

The top 10 number of purchase by department
department total_order
produce 409087
dairy eggs 217051
snacks 118862
beverages 114046
frozen 100426
pantry 81242
bakery 48394
canned goods 46799
deli 44291
dry goods pasta 38713

Sales Patterns
Here, we would like to observe the pattern of sales in depth by spiltting into departments. First, it is the pattern of weekly sales.


From these graphs, we could observe the patterns as follow:

  1. Although in the graph shown at the beginning illustrates that the peak of purchase usually is on Sunday and Monday, we can see alcohol is the exception here. For Alcohol, the figure increases slightly from the trough on Monday and reaches the top on Friday, then decreases sharply on Saturday.

  2. The other departments have similar pattern. The figures decrease from the top on Sunday, and then start increasing on Friday.

PCA

We analyze the association between the numbers of orders from different departments.

PCA by department PCA explains the similarity of variables. There are two metrics which are correlation(scaled) and covariance(not scaled). In our analysis, we focus on the relationship between the number of order from each department and day-of-week that users purchase. Thus, we will focus our PCA analysis on non-scale, i.e. using covariance. However, it would be interesting to see the differences of the results between scale and non-scaled PCAs as well, so we will also perform the PCA analysis with correlations.

Non-scaled PCA (Covariance) We observe that the first and second components explain 46.68% and 13.76% of variance of the data. Referring to the rule of thumb which selects the number of dimensions that allow to explain at least 75% of the variation, therefore comp 1 - comp 5 are selected and around 79.8% of variance of the data are explained.

Our finding: 1. Produce has the highest variation. Also, it is highly positively correlated with Dim1 and negatively correlated with Dim2 2. The other departments including the second to sixth largest variance variables(Dairy egg, Snacks, Frozen, Beverages and Pantry) are positively correlated with Dim1 and Dim2.

#>         eigenvalue percentage of variance
#> comp 1     12.6419                46.6895
#> comp 2      3.7269                13.7642
#> comp 3      2.1298                 7.8658
#> comp 4      1.5768                 5.8236
#> comp 5      1.5351                 5.6694
#> comp 6      1.1152                 4.1186
#> comp 7      0.6467                 2.3883
#> comp 8      0.6098                 2.2523
#> comp 9      0.5149                 1.9017
#> comp 10     0.4686                 1.7308
#> comp 11     0.4194                 1.5490
#> comp 12     0.3796                 1.4018
#> comp 13     0.3128                 1.1552
#> comp 14     0.2797                 1.0329
#> comp 15     0.2621                 0.9681
#> comp 16     0.1236                 0.4563
#> comp 17     0.1212                 0.4476
#> comp 18     0.1040                 0.3840
#> comp 19     0.0833                 0.3076
#> comp 20     0.0146                 0.0538
#> comp 21     0.0107                 0.0397
#>         cumulative percentage of variance
#> comp 1                               46.7
#> comp 2                               60.5
#> comp 3                               68.3
#> comp 4                               74.1
#> comp 5                               79.8
#> comp 6                               83.9
#> comp 7                               86.3
#> comp 8                               88.6
#> comp 9                               90.5
#> comp 10                              92.2
#> comp 11                              93.8
#> comp 12                              95.2
#> comp 13                              96.3
#> comp 14                              97.3
#> comp 15                              98.3
#> comp 16                              98.8
#> comp 17                              99.2
#> comp 18                              99.6
#> comp 19                              99.9
#> comp 20                             100.0
#> comp 21                             100.0
#>                    Dim.1     Dim.2    Dim.3     Dim.4     Dim.5
#> canned goods     0.22151  1.20e-01  0.00149  0.122510 -0.000978
#> dairy eggs       0.89270  1.35e+00 -0.89979 -0.249501  0.003429
#> produce          3.38519 -5.46e-01  0.12652 -0.018243  0.027935
#> beverages        0.08314  5.02e-01  0.50253 -0.087091  1.104763
#> deli             0.16873  1.60e-01  0.03477  0.037581 -0.022161
#> frozen           0.26363  5.67e-01  0.19967  1.133036 -0.108314
#> pantry           0.27934  2.71e-01  0.02251  0.149302 -0.010888
#> snacks           0.27092  8.93e-01  1.00108 -0.410478 -0.539367
#> bakery           0.15404  2.03e-01 -0.00956  0.036576 -0.005598
#> household       -0.01834  1.16e-01  0.06346  0.031781  0.091937
#> meat seafood     0.12813  6.36e-02 -0.01244  0.038850 -0.002224
#> personal care   -0.00385  5.93e-02  0.03544  0.020033  0.034508
#> dry goods pasta  0.16510  1.59e-01 -0.00625  0.103859 -0.016606
#> babies           0.05536  6.88e-02 -0.02009  0.008534 -0.014092
#> missing          0.03241  2.57e-02  0.00731  0.008221  0.003323
#> other            0.00254  3.32e-03  0.00223  0.001340  0.001498
#> breakfast        0.07014  1.66e-01  0.04165  0.002389 -0.015789
#> international    0.05296  2.64e-02  0.00637  0.020592 -0.002705
#> alcohol         -0.02207  1.25e-03  0.00601 -0.000461  0.005266
#> bulk             0.00743  7.95e-05  0.00184 -0.001773 -0.001129
#> pets            -0.00531  1.53e-02  0.00650  0.008148  0.008623

Scaled PCA (Correlation)

We find that the first and second components can explain only 13.6% and 6.6% respectively, and we need 15 components (out of 21) to explain 75% of the variation. This means that correlations between departments are very low and we cannot use PCA to reduce the dimensions of the scaled data.

#>         eigenvalue percentage of variance
#> comp 1       2.861                  13.62
#> comp 2       1.382                   6.58
#> comp 3       1.167                   5.55
#> comp 4       1.049                   4.99
#> comp 5       1.035                   4.93
#> comp 6       1.008                   4.80
#> comp 7       0.990                   4.71
#> comp 8       0.972                   4.63
#> comp 9       0.944                   4.49
#> comp 10      0.931                   4.43
#> comp 11      0.903                   4.30
#> comp 12      0.874                   4.16
#> comp 13      0.871                   4.15
#> comp 14      0.839                   4.00
#> comp 15      0.807                   3.84
#> comp 16      0.791                   3.77
#> comp 17      0.772                   3.67
#> comp 18      0.760                   3.62
#> comp 19      0.736                   3.51
#> comp 20      0.714                   3.40
#> comp 21      0.595                   2.83
#>         cumulative percentage of variance
#> comp 1                               13.6
#> comp 2                               20.2
#> comp 3                               25.8
#> comp 4                               30.8
#> comp 5                               35.7
#> comp 6                               40.5
#> comp 7                               45.2
#> comp 8                               49.8
#> comp 9                               54.3
#> comp 10                              58.8
#> comp 11                              63.1
#> comp 12                              67.2
#> comp 13                              71.4
#> comp 14                              75.4
#> comp 15                              79.2
#> comp 16                              83.0
#> comp 17                              86.6
#> comp 18                              90.3
#> comp 19                              93.8
#> comp 20                              97.2
#> comp 21                             100.0